Skip to content

fix: use read lock for clone reuse check to unblock parallel plans#6376

Open
matthewmrichter wants to merge 9 commits intorunatlantis:mainfrom
matthewmrichter:matthewmrichter/optimistic-clone-read-lock
Open

fix: use read lock for clone reuse check to unblock parallel plans#6376
matthewmrichter wants to merge 9 commits intorunatlantis:mainfrom
matthewmrichter:matthewmrichter/optimistic-clone-read-lock

Conversation

@matthewmrichter
Copy link
Copy Markdown

Summary

  • Adds an optimistic read-lock fast path in Clone() so parallel plan goroutines don't serialize on the shared clone directory write lock
  • When the clone directory already exists and is at the correct commit (the common autoplan case), the check now runs under a read lock instead of an exclusive write lock
  • Falls through to the existing write-lock path unchanged when cloning or updating is needed

Problem

When parallel_plan: true is enabled, Atlantis dispatches project plans as concurrent goroutines via sizedwaitgroup. However, each goroutine calls Clone() which unconditionally acquires an exclusive write lock (RWMutex.Lock()) on the shared clone directory before checking if the repo is at the correct commit.

Meanwhile, runSteps() holds a read lock (RWMutex.RLock()) on the same directory during plan execution. Since RWMutex.Lock() waits for all readers to release, each project's Clone() blocks until the previous project's plan finishes:

Goroutine A:  Clone() [WRITE LOCK] → check → release → runSteps() [READ LOCK] → plan...
Goroutine B:  Clone() [WRITE LOCK] → BLOCKED by A's read lock → waits...
Goroutine C:  Clone() [WRITE LOCK] → BLOCKED → waits...

This serializes parallel plans to effectively 1 at a time, regardless of pool size.

Fix

Add a fast path before the write lock: check if the directory exists and is at the correct commit under a read lock. isBranchAtTargetRef() only runs git rev-parse which is read-only and safe under a shared lock.

Goroutine A:  Clone() [READ LOCK] → at correct commit → return → runSteps() [READ LOCK] → plan...
Goroutine B:  Clone() [READ LOCK] → at correct commit → return → runSteps() [READ LOCK] → plan...
Goroutine C:  Clone() [READ LOCK] → at correct commit → return → runSteps() [READ LOCK] → plan...

All goroutines can now pass the fast path simultaneously and proceed to plan in parallel.

If the fast path fails (directory doesn't exist, wrong commit, error), execution falls through to the existing write-lock path which re-checks via attemptReuseCloneDir() — no TOCTOU race.

Testing

All existing clone tests pass. The fast path is exercised by TestClone_MasterHasDiverged which logs "repo is at correct commit ... (fast path)".

Fixes #5364
Relates to #1618

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 9, 2026 15:21
@github-actions github-actions Bot added the go Pull requests that update Go code label Apr 9, 2026
@matthewmrichter
Copy link
Copy Markdown
Author

This PR addresses the root cause behind #5364 and #1618.

The core issue is that Clone() in working_dir.go unconditionally acquires an exclusive write lock on the shared clone directory, even when the directory already exists at the correct commit. Since runSteps() holds a read lock during plan execution, and RWMutex.Lock() waits for all readers to release, parallel plan goroutines serialize — each one must wait for the previous project's plan to finish before its Clone() can proceed.

The fix adds a read-lock fast path for the common case (directory exists, correct commit), allowing all goroutines to pass through simultaneously.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements an optimistic read-lock fast path in the Clone() method to unblock parallel plans. When the clone directory already exists and is at the correct commit (the common case for parallel plans), the check now runs under a read lock instead of an exclusive write lock. This allows concurrent plan goroutines to pass the clone check simultaneously instead of serializing on the write lock. If the fast path fails, execution falls through to the existing write-lock path for cloning or updating.

Changes:

  • Add read-lock fast path for the common case (directory exists and is up-to-date)
  • Move wrappedGitContext declaration earlier to support both fast and slow paths
  • Preserve existing write-lock path as fallback for clone/update operations

@matthewmrichter
Copy link
Copy Markdown
Author

Root Cause: introduced in v0.41.0 by PR #6230

The serialization behavior was introduced by commit 7758ed6c ("fix: more comprehensive fencing of git operations", PR #6230), merged 2026-03-18 and first released in v0.41.0.

That PR made two changes that together create the serialization:

  1. Clone() upgraded from sync.Mutex to sync.RWMutex write lock — in v0.40.0 and earlier, Clone() used a simple sync.Mutex (cloneLocks). Once released, plans ran freely. In v0.41.0, it was changed to an RWMutex write lock (gitWriteLock).

  2. runSteps() now acquires a read lock — a new GitReadLock() call was added to project_command_runner.go:runSteps(), held for the entire duration of plan/apply step execution.

The combination creates a deadlock-like serialization: Clone() needs a write lock but runSteps() holds a read lock, and RWMutex.Lock() waits for all readers to release. So each project's Clone() blocks until the previous project's plan steps finish.

Prior to v0.41.0, parallel_plan: true worked as expected. After upgrading to v0.41.0, parallel plans serialize to 1 at a time. We confirmed this empirically: analyzing Datadog logs across 100 plan completions showed a hard cap of 2 concurrent plans (the brief overlap during lock handoff), compared to the configured pool size of 6.

The fix in this PR adds a read-lock fast path in Clone() for the common case (directory exists, correct commit), allowing all goroutines to proceed in parallel without ever contending on the write lock.

@matthewmrichter
Copy link
Copy Markdown
Author

Safety analysis — this fix preserves the protections from #6230

CC @pseudomorph @jamengual @lukemassa — as author/reviewers of #6230 which introduced the git fencing.

#6230 solved two real problems with checkout-strategy=merge and parallel plans:

  1. git reset --hard origin/main running while TF steps execute — files change mid-operation, causing inconsistent plan/apply state
  2. Concurrent git fetch corruption — simultaneous HasDiverged calls cause ref errors and false divergence detection

This PR's fast path is safe because it only fires when no git mutation is needed:

  • isBranchAtTargetRef() runs git rev-parse (read-only) under a read lock
  • If the directory is already at the correct commit, we return immediately — no git reset, git merge, or git fetch occurs
  • If the check fails for any reason (wrong commit, error, merge strategy divergence), we fall through to the existing write-lock path unchanged
  • runSteps() still holds a read lock during plan/apply execution, preventing any concurrent git mutations

Merge strategy is handled correctly: isBranchAtTargetRef() checks HEAD^2 for merge strategy. If the base branch has diverged and needs re-merging, HEAD^2 won't match the PR head commit, so the fast path correctly fails and the full write-lock path (with updateToRefgit reset + git merge) executes as before.

The pre-existing gap between Clone() and runSteps() (where MergeAgain(), file stat, and validation run without holding a lock) is unchanged by this PR — it exists in the current code regardless.

Net effect: parallel plans that don't need git mutations (the common autoplan case) can run concurrently, while plans that do need mutations still serialize safely through the write lock.

@pseudomorph
Copy link
Copy Markdown
Contributor

This seems sensible to me. Will need the others to chime in for approval.

lukemassa
lukemassa previously approved these changes Apr 11, 2026
Copy link
Copy Markdown
Contributor

@lukemassa lukemassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Apr 11, 2026
// Fast path: if the directory already exists and is at the right commit,
// verify under a READ lock so we don't block concurrent plan operations.
// isBranchAtTargetRef only runs "git rev-parse" which is read-only.
if _, err := os.Stat(cloneDir); err == nil {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pre-existing pattern, not introduced by this PR. The cloneDir variable is constructed by w.cloneDir() (line 100) which builds a path under Atlantis's managed data directory — this same variable is already used in os.Stat(), os.RemoveAll(), os.MkdirAll(), and git clone throughout the existing code (attemptReuseCloneDir, forceClone, etc.).

This PR adds one new os.Stat(cloneDir) call on line 106 as an early check before the existing code paths. The path construction and all downstream filesystem operations are unchanged.

The values that feed into cloneDir (repo name, PR number, workspace) come from GitHub webhook payloads which are validated by Atlantis's repo allowlist (ATLANTIS_REPO_ALLOWLIST) and GitHub's own repo naming constraints. Path traversal is not possible through these values.

@pseudomorph
Copy link
Copy Markdown
Contributor

pseudomorph commented Apr 15, 2026

I think this issue might also be triggered with MergeAgain.

We've been running an image with this fix in. So far, so good. I'll put a fix together for the MergeAgain issue and test it out. If it works, I'll put the patch up as a review comment.

@pseudomorph
Copy link
Copy Markdown
Contributor

Confirmed and fixed.

This is what I used:

❯ git diff upstream/main..HEAD server/events/working_dir.go | cat
diff --git a/server/events/working_dir.go b/server/events/working_dir.go
index ecf7c6e0..160892e8 100644
--- a/server/events/working_dir.go
+++ b/server/events/working_dir.go
@@ -157,6 +169,12 @@ func (w *FileWorkspace) attemptReuseCloneDir(logger logging.SimpleLogging, c wra
 // MergeAgain merges again with upstream if we are using the merge checkout strategy,
 // and upstream has been modified since we last checked.
 // It returns a flag indicating whether we had to merge with upstream again.
+//
+// Locking strategy: recheckDiverged uses the read and ref lock to serialize git metadata
+// operations (URL updates, fetch, status) without taking the repo write lock.
+// The write lock is only acquired when we actually need to merge.
 func (w *FileWorkspace) MergeAgain(
 	logger logging.SimpleLogging,
 	headRepo models.Repo,
@@ -168,28 +186,31 @@ func (w *FileWorkspace) MergeAgain(
 	}
 
 	cloneDir := w.cloneDir(p.BaseRepo, p, workspace)
-	// We atomically set the recheckRequiredMap flag here before grabbing the clone lock.
-	// If the flag is cleared after we grab the lock, it means some other thread
-	// did the necessary work late enough that we do not have to do it again.
+	// We atomically set the recheckRequiredMap flag here before grabbing any lock.
+	// If the flag is cleared after we grab the write lock, it means some other
+	// thread did the necessary work late enough that we do not have to do it again.
 	recheckRequiredMap.Store(cloneDir, struct{}{})
 
-	// Unconditionally wait for the clone lock here, if anyone else is doing any clone
-	// operation in this directory, we wait for it to finish before we check anything.
+	c := wrappedGitContext{cloneDir, headRepo, p}
+
+	if !w.recheckDiverged(logger, p, headRepo, cloneDir) {
+		recheckRequiredMap.Delete(cloneDir)
+		return false, nil
+	}
+
+	// Diverged — acquire the write lock to perform the merge, which mutates the
+	// working tree (git reset --hard + git merge).
 	gitWriteUnlockFn := w.gitWriteLock(cloneDir)
 	defer gitWriteUnlockFn()
 
 	if _, exists := recheckRequiredMap.Load(cloneDir); !exists {
-		logger.Debug("Skipping upstream check. Some other thread has done this for us")
+		logger.Debug("Skipping upstream merge. Some other thread has done this for us")
 		return false, nil
 	}
 	recheckRequiredMap.Delete(cloneDir)
 
-	c := wrappedGitContext{cloneDir, headRepo, p}
-	if w.recheckDiverged(logger, p, headRepo, cloneDir) {
-		logger.Info("base branch may have been updated, using merge strategy and will merge again")
-		return true, w.mergeAgain(logger, c)
-	}
-	return false, nil
+	logger.Info("base branch may have been updated, using merge strategy and will merge again")
+	return true, w.mergeAgain(logger, c)
 }
 
 // recheckDiverged returns true if the branch we're merging into has diverged
@@ -199,16 +220,20 @@ func (w *FileWorkspace) MergeAgain(
 // and we have to perform a new merge.
 // If there are any errors we return true since we prefer to assume divergence
 // for safety.
-// Locks are acquired by the caller.
+// Acquires gitRefLock internally to serialize git metadata operations.
+// Does NOT require the caller to hold the repo read or write lock.
 func (w *FileWorkspace) recheckDiverged(logger logging.SimpleLogging, p models.PullRequest, headRepo models.Repo, cloneDir string) bool {
 	if !w.CheckoutMerge {
-		// It only makes sense to warn that main has diverged if we're using
-		// the checkout merge strategy. If we're just checking out the branch,
-		// then it doesn't matter what's going on with main because we've
-		// decided to always run off the branch.
 		return false
 	}
 
+	// Hold the ref and read locks for the entire sequence: URL updates write .git/config,
+	// remote update writes .git/refs/remotes/, and hasDiverged does fetch+status.
+	// All of these touch shared git metadata and must be serialized with concurrent HasDiverged callers.
+	unlockRef := w.gitRefLock(cloneDir)
+	defer unlockRef()
+	gitReadUnlockFn := w.gitReadLock(cloneDir)
+	defer gitReadUnlockFn()
 	// Bring our remote refs up to date.
 	// Reset the URL in case we are using github app credentials since these might have
 	// expired and refreshed and the URL would now be different.
@@ -237,9 +262,6 @@ func (w *FileWorkspace) recheckDiverged(logger logging.SimpleLogging, p models.P
 		}
 	}
 
-	// We already hold the write lock; take ref lock so fetch is serialized with other HasDiverged callers.
-	unlockRef := w.gitRefLock(cloneDir)
-	defer unlockRef()
 	return w.hasDiverged(logger, cloneDir)
 }
 
@@ -251,7 +273,7 @@ func (w *FileWorkspace) HasDiverged(logger logging.SimpleLogging, cloneDir strin
 	}
 
 	// Hold ref lock and repo read lock for the full duration (fetch + status) so we don't race with
-	// clone/reset/merge. recheckDiverged does not take the read lock because it already holds the write lock.
+	// clone/reset/merge.
 	unlockGitRefLock := w.gitRefLock(cloneDir)
 	defer unlockGitRefLock()
 
@@ -265,8 +287,8 @@ func (w *FileWorkspace) HasDiverged(logger logging.SimpleLogging, cloneDir strin
 }
 
 // hasDiverged runs fetch and git status to detect divergence. Caller must hold
-// gitRefLock(cloneDir); if not already holding the repo write lock (e.g. from recheckDiverged),
-// caller must also hold gitReadLock(cloneDir).
+// gitRefLock(cloneDir) and gitReadLock(cloneDir) (or the write lock)
 func (w *FileWorkspace) hasDiverged(logger logging.SimpleLogging, cloneDir string) bool {
 	logger.Debug("HasDiverged: running git fetch in %s", cloneDir)
 	cmd := exec.Command("git", "fetch")
@@ -360,8 +382,8 @@ func (w *FileWorkspace) hasDivergedForPatterns(logger logging.SimpleLogging, clo
 	for _, file := range nonEmptyChangedFiles {
 		match, err := pm.MatchesOrParentMatches(file)
 		if err != nil {
-			logger.Debug("HasDiverged: match error for file %q: %s, treating as diverged", file, err)
-			return true
+			logger.Debug("HasDiverged: match error for file %q: %s", file, err)
+			continue
 		}
 		if match {
 			logger.Debug("HasDiverged: file %q matched patterns - branch has diverged", file)

@matthewmrichter
Copy link
Copy Markdown
Author

matthewmrichter commented Apr 17, 2026

I can patch in your changes to this PR if you want @pseudomorph, thanks!

Also if anyone has a suggestion to overcome the seemingly false positive CodeQL check, let me know.

@matthewmrichter matthewmrichter force-pushed the matthewmrichter/optimistic-clone-read-lock branch from c751477 to 85ecf2b Compare April 17, 2026 12:49
@matthewmrichter
Copy link
Copy Markdown
Author

Applied your patch — pushed as a follow-up commit. Tests pass locally. Thanks @pseudomorph for extending the fast-path pattern to MergeAgain() and catching the hasDivergedForPatterns issue.

Summary of the second commit:

  • MergeAgain() now calls recheckDiverged() first, only acquiring the write lock when a merge is actually needed
  • recheckDiverged() acquires its own gitRefLock + gitReadLock internally (explicit contract)
  • hasDivergedForPatterns() continues on per-file match errors instead of returning true

@jamengual
Copy link
Copy Markdown
Contributor

@matthewmrichter could you fix the conflicts?

matthewmrichter and others added 2 commits April 21, 2026 14:39
…l plans

When parallel_plan is enabled, all project goroutines call Clone() which
acquires an exclusive write lock on the shared clone directory before
checking if the repo is at the correct commit. Meanwhile, runSteps()
holds a read lock on the same directory for plan execution. Since
RWMutex.Lock() waits for all readers to release, each project's Clone()
blocks until the previous project's plan completes — serializing parallel
plans to effectively 1 at a time.

Fix: add an optimistic fast path that checks the commit under a READ lock
first (git rev-parse is read-only). If the directory exists and is at the
correct commit, return immediately without ever acquiring the write lock.
Only fall through to the write-lock path when the directory needs cloning
or updating.

This enables true parallel plan execution for the common case where the
clone directory is already checked out at the right commit (e.g., autoplan
on push, re-plan via comment).

Fixes runatlantis#5364
Relates to runatlantis#1618

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Matt Richter <matthew.richter@reserv.com>
Restructures MergeAgain() to check divergence first under read locks,
only acquiring the write lock when a merge is actually needed. This
mirrors the Clone() optimization in this PR and eliminates the second
serialization point for parallel plans using merge checkout strategy.

Changes:
- MergeAgain() now calls recheckDiverged() first; write lock acquired
  only when divergence is detected and a merge must be performed.
- recheckDiverged() now acquires its own gitRefLock + gitReadLock
  internally, making the locking contract explicit rather than relying
  on an implicit "caller holds the write lock" convention.
- hasDivergedForPatterns() now continues on per-file match errors
  instead of returning true (avoids false positives from transient
  pattern errors forcing unnecessary merges).

Signed-off-by: Matt Richter <matthew.richter@reserv.com>
@matthewmrichter matthewmrichter force-pushed the matthewmrichter/optimistic-clone-read-lock branch from 85ecf2b to 39875d3 Compare April 21, 2026 18:42
@matthewmrichter
Copy link
Copy Markdown
Author

@matthewmrichter could you fix the conflicts?

Done!

@jamengual
Copy link
Copy Markdown
Contributor

@pseudomorph one more review before we merge?

Comment thread server/events/working_dir.go Outdated
Comment thread server/events/working_dir.go
Co-authored-by: Ross Strickland <rkstrickland@gmail.com>
Signed-off-by: Matt Richter <4481324+matthewmrichter@users.noreply.github.com>
Comment thread server/events/working_dir.go Outdated
Comment thread server/events/working_dir.go Outdated
Comment thread server/events/working_dir.go
@nvanheuverzwijn
Copy link
Copy Markdown
Contributor

@matthewmrichter I added suggestion if you can merge them

matthewmrichter and others added 2 commits April 24, 2026 12:27
Co-authored-by: Nicolas Vanheuverzwijn <nicolas.vanheu@gmail.com>
Signed-off-by: Matt Richter <4481324+matthewmrichter@users.noreply.github.com>
Co-authored-by: Nicolas Vanheuverzwijn <nicolas.vanheu@gmail.com>
Signed-off-by: Matt Richter <4481324+matthewmrichter@users.noreply.github.com>
Comment thread server/events/working_dir.go Outdated
@nvanheuverzwijn
Copy link
Copy Markdown
Contributor

hey @matthewmrichter , if you could merge my suggested changes, I messed up the lint. After that, I think we should merge that asap.

Co-authored-by: Nicolas Vanheuverzwijn <nicolas.vanheu@gmail.com>
Signed-off-by: Matt Richter <4481324+matthewmrichter@users.noreply.github.com>
@jamengual
Copy link
Copy Markdown
Contributor

jamengual commented Apr 24, 2026

the tests will need to be fixed. @matthewmrichter

@GenPage GenPage enabled auto-merge (squash) April 26, 2026 20:39
@GenPage
Copy link
Copy Markdown
Member

GenPage commented Apr 26, 2026

I enable auto-merge once tests are fixed. A maintainer will need to re-approve/run the tests once the tests are fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code lgtm This PR has been approved by a maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an option to keep plans parallel but run only inits sequentially

8 participants